-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
an option to download only added files for a given dataset version #1100
base: master
Are you sure you want to change the base?
an option to download only added files for a given dataset version #1100
Conversation
I believe that this will also download the modified files (which is good), but maybe the name |
@eugen-ajechiloae-clearml done, I renamed it |
unified_list = set() | ||
for ds_id in datasets: | ||
dataset = self.get(dataset_id=ds_id) | ||
unified_list |= set(dataset._dataset_file_entries.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kirillfish This line seems to break:
unified_list |= set(dataset._dataset_file_entries.values())
TypeError: unhashable type: 'FileEntry'
Did you mean to construct the set based on the ids? Like:
unified_list |= {entry.id for entry in dataset._dataset_file_entries.values()}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kirillfish btw, here's a simple test snippet (you'll need to adapt it a little) to check that it works properly.
from pathlib import Path
import os
from clearml import Dataset, StorageManager
def main():
manager = StorageManager()
print("STEP1 : Downloading mnist dataset")
mnist_dataset = Path(manager.get_local_copy(
remote_url="https://allegro-datasets.s3.amazonaws.com/datasets/MNIST.zip", name="MNIST"))
mnist_dataset_train = mnist_dataset / "TRAIN"
mnist_dataset_test = mnist_dataset / "TEST"
print("STEP2 : Creating the training dataset")
mnist_dataset = Dataset.create(
dataset_project="dataset_examples", dataset_name="MNIST Training Dataset")
mnist_dataset.add_files(path=mnist_dataset_train, dataset_path="TRAIN")
mnist_dataset.upload()
mnist_dataset.finalize()
print("STEP3 : Create a child dataset of mnist dataset using TEST Dataset")
child_dataset = Dataset.create(
dataset_project="dataset_examples", dataset_name="MNIST Complete Dataset", parent_datasets=[mnist_dataset.id])
child_dataset.add_files(path=mnist_dataset_test, dataset_path="TEST")
child_dataset.upload()
child_dataset.finalize()
print("We are done, have a great day :)")
main()
dataset_path = Dataset.get("<replace with child dataset id, i.e. MNIST Complete Dataset ID>").get_local_copy()
print(Path(dataset_path).glob("*"))
os.rmtree(dataset_path)
dataset_path = Dataset.get("<replace with child dataset id, i.e. MNIST Complete Dataset ID>").get_local_copy(ignore_parent_datasets=True)
print(Path(dataset_path).glob("*"))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kirillfish any news on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kirillfish , any update?
So far, when calling
get_local_copy
method on aclearml.Dataset
object, you would download all the files from the dataset AND all its parents recursively, or create soft links for all files that have been downloaded previously. But there was no way to get only files added to this particular version of the dataset, ignoring all the parents. This little PR implements this exact feature.Testing Instructions
parent
dataset and add some fileschild
dataset, inherited from the first one, and add some more filesonly_added
argument (False
by default):data_base_dir
will contain only files returned bylist_added_files
Other Information
Potential issue:
Soft links are still being created for files in the diff which have already been downloaded - this is ok
But it you first call
child.get_local_copy(only_added=True)
and then once againchild.get_local_copy()
, it will not create soft links for existing files and download the diff once again -- probably not ok... The same applies to "grandchildren" datasets. Still figuring out why. On the other hand, this could be ok if we assumeonly_added=True
flag is supposed to be used only for debug purposes or to quickly inspect datasets.